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Abstract 



Regularization of ill-posed linear inverse problems via l\ penalization has been pro- 
posed for cases where the solution is known to be (almost) sparse. One way to obtain 
the minimizer of such an l\ penalized functional is via an iterative soft-thresholding 
algorithm. We propose an alternative implementation to ^-constraints, using a gra- 
dient method, with projection on ^i-balls. The corresponding algorithm uses again 
iterative soft-thresholding, now with a variable thresholding parameter. We also pro- 
pose accelerated versions of this iterative method, using ingredients of the (linear) 
steepest descent method. We prove convergence in norm for one of these projected 
gradient methods, without and with acceleration. 

1 Introduction 

Our main concern in this paper is the construction of iterative algorithms to solve inverse 
problems with an ^-penalization or an ^-constraint, and that converge faster than the 
iterative algorithm proposed in [2T] (see also formulas ([7]) and ([8]) below). Before we get 
into technical details, we introduce here the background, framework, and notations for our 
work. 

In many practical problems, one cannot observe directly the quantities of most interest; 
instead their values have to be inferred from their effect on observable quantities. When 
this relationship between observable y and interesting quantity / is (approximately) linear, 
as it is in surprisingly many cases, the situation can be modeled mathematically by the 
equation 



where A is a linear operator mapping a vector space IC (which we assume to contain all 
possible "objects" /) to a vector space TL (which contains all possible data y). The vector 
spaces K, and TL can be finite- or infinite-dimensional; in the latter case, we assume that /C 
and TL are (separable) Hilbert spaces, and that A : K —* TL is a bounded linear operator. 
Our main goal consists in reconstructing the (unknown) element / £ tC, when we are 
given y. If A is a "nice" , easily invertible operator, and if the data y are free of noise, then 
this is a trivial task. Often, however, the mapping A is ill-conditioned or not invertible. 
Moreover, typically JT]) is only an idealized version in which noise has been neglected; a 
more accurate model is 



in which the data are corrupted by an (unknown) noise. In order to deal with this type 
of reconstruction problem a regularization mechanism is required |30j. Regularization 



y = Af , 




y = Af + e 



(2) 



techniques try, as much as possible, to take advantage of (often vague) prior knowledge 
one may have about the nature of /. The approach in this paper is tailored to the case 
when / can be represented by a sparse expansion, i.e., when / can be represented by a 
series expansion with respect to an orthonormal basis or a frame |20t [TT] that has only 
a small number of large coefficients. In this paper, as in [21] . we model the sparsity 
constraint by adding an l\— term to a functional to be minimized; it was shown in [21] 
that this assumption does indeed correspond to a regularization scheme. 

Several types of signals appearing in nature admit sparse frame expansions and thus, 
sparsity is a realistic assumption for a very large class of problems. For instance, natural 
images are well approximated by sparse expansions with respect to wavelets or curvelets 

HUE]. 

Sparsity has had already a long history of successes. The design of frames for sparse 
representations of digital signals has led to extremely efficient compression methods, such 
as JPEG2000 and MP3 [39J. A new generation of optimal numerical schemes has been 
developed for the computation of sparse solutions of differential and integral equations, 
exploiting adaptive and greedy strategies [121 [131 [TU [TTl [18]. The use of sparsity in 
inverse problems for data recovery is the most recent step of this concept's long career of 
"simplifying and understanding complexity", with an enormous potential in applications 
[2[S[ig[I3[in[22[23[2g[H In particular, the observation that 

it is possible to reconstruct sparse signals from vastly incomplete information just seeking 
for the ^i-minimal solutions El l26l W\\ has led to a new line of research called sparse 
recovery or compressed sensing, with very fruitful mathematical and applied results. 



2 Framework and Notations 

Before starting our discussion let us briefly introduce some of the notations we will need. 
For some countable index set A we denote by t v = £ P (A), 1 < p < oo, the space of real 
sequences x = (xa)agA with norm 




1 < p < oo 



and ||x||oo := sup^gA \x\\ as usual. For simplicity of notation, in the following || • || will 
denote the ^2-norm || • ||2- 

As is customary for an index set, we assume we have a natural enumeration order for the 
elements of A, using (implicitly) a one-to-one map Af from A to N. In some convergence 
proofs, we shall use the shorthand notations |A| for M(X), and (in the case where A is 
infinite) A — > oo for M(X) — > oo. 

We also assume that we have a suitable frame : A € A} C /C indexed by the countable 
set A. This means that there exist constants c%,C2 > such that 

aWfWl < E K/>^>! 2 ^ C 2ll/HL fOT a11 / e K - (3) 

AeA 

Orthonormal bases are particular examples of frames, but there also exist many interesting 
frames in which the ip\ are not linearly independent. Frames allow for a (stable) series 
expansion of any / G JC of the form 

AeA 
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where x = (xa)agA G ^2 (A). The linear operator F : £2 (A) - ► £ (called the synthesis map 
in frame theory) is bounded because of ([3]). When {^a : A 6 A} is a frame but not a basis, 
the coefficients x\ need not be unique. For more details on frames and their differences 
from bases we refer to [11] . 

We shall assume that / is sparse, i.e., that / can be written by a series of the form (|4|) 
with only a small number of non- vanishing coefficients x\ with respect to the frame 
or that / is compressible, i.e., that / can be well-approximated by such a sparse expansion. 
This can be modeled by assuming that the sequence x is contained in a (weighted) £\(A)- 
space. Indeed, the minimization of the ^i(A) norm promotes such sparsity. (This has been 
known for many years, and put to use in a wide range of applications, most notably in 
statistics. David Donoho calls one form of it the Logan phenomenon in [28] - see also [27] 
-, after its first observation by Ben Logan [37].) These considerations lead us to model 
the reconstruction of a sparse / as the minimization of the following functional: 

F T {x) = \\Kx-y\\ 2 n + 2T\\x\\ 1 , (5) 

where we will assume that the data y and the linear operator K := A o F : £2 (A) — > TC 
are given. The second term in ([5]) is often called the penalization or regularizing term; the 
first term goes by the name of discrepancy, 

D(x) := \\Kx - yf n . (6) 

In what follows we shall drop the subscript Ji, because the space in which we work will 
always be clear from the context. We discuss the problem of finding (approximations to) 
x(t) in £2 (A) that minimize the functional ([5]). (We adopt the usual convention that for 
u £ -^2 (A) \^i(A), the penalty term "equals" 00, and that, for such u, F T (u) > F T (x) for 
all x £ ^i(A). Since we want to minimize F T , we shall consider, implicitly, only x £ ^i(A).) 
The solutions /(t) to the original problem are then given by f(r) = Fx(r). 

Several authors have proposed independently an iterative soft-thresholding algorithm 
to approximate the solution x(r) pTEJ E2J EH [29]. More precisely, x(t) is the limit of 
sequences x^ defined recursively by 



x (n+l) 



W + K * y _ K*Kx in) ] , (7) 



starting from an arbitrary x^ , where S T is the soft-thresholding operation defined by 
^t(x)\ = S T (x\) with 

X — T X > T 

S T (x) = { \x\ < t . (8) 

X + T X < —T 



Convergence of this algorithm was proved in [21J. Soft-thresholding plays a role in 
this problem because it leads to the unique minimizer of a functional combining £2 and 
£\— norms, i.e., (see [10], [2Tj) 

§ r (a) = arg min (||x — a\\ 2 + 2t||x||i) . (9) 

We will call the iteration ([7]) the iterative soft-thresholding algorithm or the thresholded 
Landweber iteration. 
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(a) (b) 

Figure 1: The path, in the ||x||i vs. \\Kx — y|| 2 plane, followed by the iterates x^ n > of 
three different iterative algorithms. The operator K and the data y are taken from a 
seismic tomography problem [38] (see also Section [6|). The boxes (in both (a) and (b)) 
correspond to the thresholded Landweber algorithm. In this example, iterative thresholded 
Landweber ([7]) first overshoots the t\ norm of the limit (represented by the fat dot), and 
then requires a large number of iterations to reduce ||x^ n ^||i again (500 are shown in this 
figure). In (a) the crosses correspond to the path followed by the iterates of the projected 
Landweber iteration (|10p; in (b) the triangles correspond to the projected steepest descent 
iteration (jlip : in both cases, only 15 iterates are shown. The discrepancy decreases more 
quickly for projected steepest descent than for the projected Landweber algorithm. How 
this translates into faster convergence (in norm) is discussed in Section El The solid line 
corresponds to the limit trade-off curve, generated by x(t) for decreasing values of r > 0. 
The vertical axes uses a logarithmic scale for clarity. 

3 Discussion of the Thresholded Landweber Iteration 

The problem of finding the sparsest solution to the under-determined linear equation 
Kx = y is a hard combinatorial problem, not tractable numerically except in relatively 
low dimensions. For some classes of K, however, one can prove that the problem reduces 
to the convex optimization problem of finding the solution with the smallest t\ norm 
[261 HI [6]. Even for K outside this class, l\— minimization seems to lead to very good 
approximations to the sparsest solutions. It is in this sense that an algorithm of type (J7]) 
could conceivably be called 'fast': it is fast compared to a brute-force exhaustive search 
for the sparsest x. 

A more honest evaluation of the speed of convergence of algorithm (|7|) is a comparison 
with linear solvers that minimize the corresponding £2 penalized functional, such as, e.g., 
the conjugate gradient method. One finds, in practice, that the thresholded Landweber 
iteration (|T|) is not competitive at all in this comparison. It is, after all, the composition of 
thresholding with the (linear) Landweber iteration £( n+1 ) = x( n ) -\- K* y — K* K x^ n \ which 
is a gradient descent algorithm with a fixed step size, known to converge usually quite 
slowly; interleaving it with the nonlinear thresholding operation does unfortunately not 
change this slow convergence. On the other hand, this nonlinearity did foil our attempts 
to "borrow a leaf" from standard linear steepest descent methods by using an adaptive 
step length - once we start taking larger steps, the algorithm seems to no longer converge 
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in at least some numerical experiments. 

We take a closer look at the characteristic dynamics of the thresholded Landweber 
iteration in Figure [TJ As this plot of the discrepancy T>(x^) = \\Kx^ — y\\ 2 versus 
shows, the algorithm converges initially relatively fast, then it overshoots the 
value ||x(t)||i (where x(t) := linin^oo x^), and it takes very long to re-correct back. In 
other words, starting from x^ = 0, the algorithm generates a path {x^; n £ N} that is 
initially fully contained in the ^i-ball Br := {x € ^2 (A); < R}, with R := ||sg(t)||i. 
Then it gets out of the ball to slowly inch back to it in the limit. A first intuitive way 
to avoid this long "external" detour is to force the successive iterates to remain within 
the ball Br. One method to achieve this is to substitute for the thresholding operations 
the projection Pb h , where, for any closed convex set C, and any x, we define Fc(x) to be 
the unique point in C for which the £2 — distance to x is minimal. With a slight abuse of 
notation, we shall denote ^b r by P_r; this will not cause confusion, because it will be clear 
from the context whether the subscript of P is a set or a positive number. We thus obtain 
the following algorithm: Pick an arbitrary x^ 6 £2 (A), for example x^ = 0, and iterate 



x (n+l) = Fr 



in) + K * y _ K*K x (.nj\ ( 1Q ) 



We will call this the projected Landweber iteration. 

The typical dynamics of this projected Landweber algorithm are illustrated in Fig. 
QJa). The norm ||a;W||i no longer overshoots R, but quickly takes on the limit value 
(i.e., ||x(r)||i); the speed of convergence remains very slow, however. In this projected 
Landweber iteration case, modifying the iterations by introducing an adaptive "descent 
parameter" (3^ > in each iteration, defining x^ n+l ^ by 



x (n+l) = ¥r 



x (n) + p(n)K*(y -Kx { ^)\ , (11 



does lead, in numerical simulations, to promising, converging results (in which it differs 
from the soft-thresholded Landweber iteration, where introducing such a descent param- 
eter did not lead to numerical convergence, as noted above). 

The typical dynamics of this modified algorithm are illustrated in Fig. [T](b) , which 
clearly shows the larger steps and faster convergence (when compared with the projected 
Landweber iteration in Fig. (Ha)). We shall refer to this modified algorithm as the projected 
gradient iteration or the projected steepest descent; it will be the main topic of this paper. 

The main issue is to determine how large we can choose the successive and still 
prove norm convergence of the algorithm in £2 (A). 

There exist results in the literature on convergence of projected gradient iterations, 
where the projections are (as they are here) onto convex sets, see, e.g., [HUB] and refer- 
ences therein. These results treat iterative projected gradient methods in much greater 
generality than we need: they allow more general functionals than T>, and the convex set 
on which the iterative procedure projects need not be bounded. On the other hand, these 
general results typically have the following restrictions: 

• The convergence in infinite-dimensional Hilbert spaces (i.e., A is countable but infi- 
nite) is proved only in the weak sense and often only for subsequences; 

• In [T] the descent parameters are typically restricted to cases for which linin^oo [5^ = 
0. In [16], it is shown that the algorithm converges weakly for any choice of 



2-e 



e ' PTI 



for e > arbitrarily small. Of most interest to us is the case 
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Figure 2: For a given vector a £ £2 ; ||§7-( a ) ||i is a piecewise linear continuous and decreasing 
function of r (strictly decreasing for r < maxj |a»|) . The knots are located at {|oi|,i : 
1 . . . m} and 0. Finding r such that ||S T (a)||i = R ultimately comes down to a linear 
interpolation. The figure is made for the finite dimensional case. 

where the are picked adaptively, can grow with n, and are not limited to values 
below rrjh) this case is not covered by the methods of either [T] or [IB] . 

To our knowledge there are no results in the literature for which the whole sequence 
(a;( n ))neN converges in the Hilbert space norm to a unique accumulation point, for "de- 
scent parameters" f3^ > 2. It is worthwhile emphasizing that strong convergence is not 
automatic: in |16|. Remark 5.12], the authors provide a counterexample in which strong 
convergence fails. (This question had been open for some time.) One of the main results 
of this paper is to prove a theorem that establishes exactly this type of convergence; see 
Theorem 15.181 below. Moreover, the result is achieved by imposing a choice of > 1 
which ensures a monotone decay of a suitable energy. This establishes a principle of best 
descent similar to the well-known steepest-descent in unconstrained minimization. 

Before we get to this theorem, we need to build some more machinery first. 

4 Projections onto ^-Balls via Thresholding Operators 

In this section we discuss some properties of ^-projections onto ^i-balls. In particular, 
we investigate their relations with thresholding operators and their explicit computation. 
We also estimate the time complexity of such projections in finite dimensions. 
We first observe a useful property of the soft-thresholding operator. 

Lemma 4.1 For any fixed a S ^2 (A) an ^ f or t > 0, ||§ T (a)||i is a piecewise linear, 
continuous, decreasing function of r; moreover, if a € ^i(A) then ||So(a)||i = ||a||i and 
||S r (a)||i = for t > maxj \ai\. 

Proof: ||S r (a)||i = J2x 1^(^)1 = ^(^aI) = Z|a A |>r(l a A| - r ); the sum in the right 
hand side is finite for r > 0. □ 
A schematic illustration is given in Figure [2j 

The following lemma shows that the £2 projection P_r(o) can be obtained by a suitable 
thresholding of a. 
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Lemma 4.2 If \\a\\i > R, then the £2 projection of a on the t\ ball with radius R is given 
by P/?(a) = S^(a) where /j, (depending on a and R) is chosen such that [|S„(a)||i = R. If 
||a||i < R then P_r(o) = §o( a ) = a - 

Proof: Suppose ||a|ji > R. Because, by Lemma [4.1} ||S^(a)||i is continuous in fi and 
|jS M (a)||i = for sufficiently large fi, we can choose [i such that 1 1 (a) 1 1 1 = R. (See 
Figure El) On the other hand (see above, or pUlGI]), b = S^(a) is the unique minimizer 
of \\x — a|[ 2 + 2/i||x||i, i.e., 

||6 - a|| 2 + 2n\\b\\i < \\x - a\\ 2 + 2n\\x\\i 

for all x 7^ b. Since = R, it follows that 

Vx G Br, x 7^ b : \\b — a\\ 2 < \\x — a\\ 2 

Hence b is closer to a than any other x in Br. In other words, Pj?(a) = b = §^(a). □ 
These two lemmas prescribe the following simple recipe for computing the projection 
P j r(o). In a first step, sort the absolute values of the components of a (an O(mlogm) 
operation if #A = m is finite), resulting in the rearranged sequence (a|)^ =1 m , with 
a-e > a i+i — f° r au Next, perform a search to find k such that 



k-1 

KiWh = E (°? - 4) < ^ < E ( a *i - 4+0 = ll s < +1 ( a )lli 



1=1 1=1 

or equivalently, 

fc-l k 



Pa*(a)\\i = E I ( a *i - <R<Y J H a e- «m) = ll^ +1 («)lli; 
t=\ t=\ 

the complexity of this step is again O(mlogm). Finally, set 
v := k^ 1 (^R — ||S * (a) , and fj, := a* k + za Then 

fc 

= ^max(|aj| -yu,0) = E ( a ? "~ /*) 
ieA £=1 
fc-i 

= ^ (^* - 4) + Ari/ = ||S .(o)||i + ku = R. 
1=1 

These formulas were also derived in |35t Lemma 4.1 and Lemma 4.2], by observing that 
Pjj(o) = a-Sf(a), where 

Sf(a) = argmin(||x- a|| 2 + 2E||a;|| 0O ), ieR m . (12) 

The latter is again a thresholding operator, but it is related to an penalty term. Similar 
descriptions of the £2 projection onto £\ balls appear also in [5]. 
Finally, Fr has the following additional properties: 
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Lemma 4.3 For any x G ^2 (A), Fr{x) is characterized as the unique vector in Br such 
that 

(w-F r (x),x-Fr{x)) <0, for all w G Br. (13) 
Moreover the projection Fr is non-expansive: 

\\Fr(x)-Fr(x')\\ < \\x -x'\\ (14) 

for all x, y G £2 (A). 

The proof is standard for projection operators onto convex sets; we include it because its 
technique will be used often in this paper. 

Proof: Because Br is convex, (1 — i)P^(x) + tw 6 Br for all w G Br and t G [0, 1]. It 
follows that \\x - Pr(x)|| 2 < \\x - [(1 - t) F R (x) +tw] || 2 for all t G [0, 1]. This implies 

< -2t (w - F R (x),x - Fr(x)) + t 2 \\w - F R (x) || 2 

for all t G [0, 1]. It follows that 

(w-¥r(x),x-¥ r {x)) < , 



which proves (|13p . 

Setting w = Fr(x') in (fTB"]) . we get, for all x,x', 

(F r (x')-Fr(x),x-Fr(x)) < 
Switching the role of x and x' one finds: 

(F R (x') -¥ R (x),x' -¥ R (x')) >0 
By combining these last two inequalities, one finds: 

(F R (x') -F R (x),x'-x- Fr(x') + F R (x)) > 

or 

\\F R (x')-F R (x)\\ 2 < (F R (x')-F R (x),x' -x); 
by Cauchy-Schwarz this gives 

\\F R (x f ) -Fr(x)\\ 2 < {Fr(x')-F r (x),x' -x) < \\¥ R (xf)-V R (x)\\\\a/-x\\, 

from which inequality (|14p follows. □ 

5 The Projected Gradient Method 

We have now collected all the terminology needed to identify some conditions on the [3^ 
that will ensure convergence of the x^ n \ defined by (jlip . to xr, the minimizer in Br of 
T>{x) = \\Kx — y|| 2 . For notational simplicity we set = K*{y — Kx^). With this 
notation, the thresholded Landweber iteration ([7]) can be written as 

x (n+l) = ^.(n) + r (n)^j _ ^ 
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As explained above, we consider, instead of straightforward soft-thresholding with fixed 
r, adapted soft-thresholding operations ^^r x oo +r (rO) that correspond to the projection 
operator Fr: 

x (n+l) = ¥r ^.(n) + r (n)^j _ ^ 

The dependence of n(R,x^ + r^) on R is described above; R is kept fixed throughout 
the iterations. If, for a given value of r, R were picked such that R = R T := ||x T ||i (where 
x T is the minimizer of \\Kx — 2/|| 2 + 2r|jx||i), then Lemma 14.21 would ensure that x T = xr. 
Of course, we don't know, in general, the exact value of ||x r ||i, so that we can't use it as 
a guideline to pick R. In practice, however, it is customary to determine x T for a range of 
r-values; this then amounts to the same as determining xr for a range of R. 
We now propose to change the step into a step /3^A n ^ (in the spirit of the "classical" 
steepest descent method), and to define the algorithm: Pick an arbitrary x^ G £2 (A), for 
example x^ = 0, and iterate 

x (n+l) = Fr ^(n) + p(n) T (n)\ ^ 

In this section we prove the norm convergence of this algorithm to a minimizer xr of 
\\Kx — y\\ 2 in Br, under some assumptions on the descent parameters [3^ > 1. 

5.1 General properties 

We begin with the following characterization of the minimizers of T> on Br. 

Lemma 5.1 The vector xr G ^(A) is a minimizer ofT>(x) = \\Kx — y\\ 2 on Br if and 
only if 

F R (x R + PK*(y-Kx R ))=x R , (18) 
for any (3 > 0, which in turn is equivalent to the requirement that 

(K*(y - Kx R ),w- xr) < 0, for all w G B R . (19) 

To lighten notation, we shall drop the subscript R on xr whenever no confusion is possible. 
Proof: If x minimizes T> on Br, then for all w G Br, and for all t G [0, 1], 

V{x) < V{{1 -t)x + tw), or 
\\Kx — y\\ 2 < \\Kx — y + tK(w — x)\\ 2 , or 

< 2t(Kx-y,K(w-x))+t 2 \\K(w-x)\\ 2 . 

This implies 

(K*{y-Kx),w-x) <0. (20) 
It follows from this that, for all w G Br and for all (3 > 0, 

(x + f3K*(y-Kx)-x,w-x) <0, (21) 

By Lemma 14.31 this implies (|18p . 

Conversely, if Fr(x + j3K*{y — Kx)) = x, then for all w G Br and for all t G [0, 1]: 

\\(x + (3K*(y-Kx))-((l-t)i + tw)\\ 2 > || {x + (3K*(y - Kx)) - x \\ 2 , 
or\\f3K*(y-Kx) + t(x-w)\\ 2 > \\f3K* {y - Kx))f , 
^ 2t(3{K*{y-Kx),x-w)+t 2 \\x-w\\ 2 > 0. 
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This implies 

(K*(y-Kx),x-w}>0 or (y - Kx, K(x - w)) > 0. 
In other words: 

-\\y - Kxf - \\Kx - Kw\\ 2 + \\(y - Kx) + K{x - w)\\ 2 > 

or 

V(x) + \\K(x-w)\\ 2 < V(w). 
This implies that x minimizes T> on Br. □ 

The minimizer of T> on Br need not be unique. We have, however 

Lemma 5.2 Ifx,x are two distinct minimizers ofT>{x) = \\Kx — y\\ 2 on Br, then Kx = 
Kx, i.e., x — x € ker K. 

Conversely, ifx,x G Br, if x minimizes \\Kx — y\\ 2 and ifx — x £ keriT then x minimizes 
\\Kx — y\\ 2 as well. 

Proof: The converse is obvious; we prove only the direct statement. From the last in- 
equality in the proof of Lemma I5TT1 we obtain V{x) + \\K(x — x)\\ 2 < T>{x) = T>(x), which 
implies \\K(x — x)\\ = 0. □ 

In what follows we shall assume that the minimizers of T> in Br are not global min- 
imizers for T>, i.e., that K*(y — Kx) ^ 0. We know from Lemma 14.21 that P_R(a) can be 
computed for ||a||i > R simply by finding the value fi > such that 1 1 (a) 1 1 1 = R; one 
has then P_r(ci) = S M (a). Using this we prove 

Lemma 5.3 Let u be the common image under K of all minimizers ofD on Br, i.e., for 
all x minimizing V in Br, Kx = u. Then there exists a unique value r > such that, for 
all (3 > and for all minimizing x 

F R (x + (3K*(y - u)) = S T/3 (x + 0K*(y - u)). (22) 

Moreover, for all A £ A we have that if there exists a minimizer x such that x\ ^ 0, then 

\(K*(y-u)) x \=r. (23) 

Proof: From Lemma 14.21 and Lemma 15.11 we know that for each minimizing x, and each 
f3 > 0, there exists a unique /j,(x, f3) such that 

x = ¥ R (x + pK*{y - u)) = S M5j/3) (x + pK*(y - u)). (24) 

For x\ 7^ we have x\ = x\ + f3(K*{y — u))\ — fi(x, /3)sgn x\] this implies sgni^ = 
sgn(xA + j3(K*(y - u))\) and also that \(K*(y - u))\\ = 4/x(5,/3). If X\ = then 
\(K*(y - u))\\ < It follows that r := /j,(x,(3)/(3 = \\K*(y - u)\\oo does not 

depend on the choice of x. Moreover, if there is a minimizer x for which x\ ^ 0, then 
\(K*(y-u)) x \=r. □ 

Lemma 5.4 If, for some A G A, two minimizers x, x satisfy x\ ^ and x\ ^ 0, then 
sgn x\ = sgn x\ . 
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Proof: This follows from the arguments in the previous proof; x\ ^ implies {K*(y — 
u))\ = Tsgnx\. Similarly, x\ ^ implies (K*(y — u))\) = rsgnx A , so that sgnx A = 
sgnx A . □ 

This immediately leads to 

Lemma 5.5 For all x G Br that minimize T>, there are only finitely many x\ ^ 0. More 
precisely, 

{A G A : x x + 0} C T := {A G A : \(K*(y - u)) x \ = \\(K*(y - u))|U}. (25) 
Moreover, if the vector e is defined by 

_ Jo, A^r 

6X \ S gn((K*(y-u)) x ), A G T, ^ 
i/ien (x, e) = i? /or eac/i minimizer xofDin Br. 

Proof: We have already proved the set inclusion. Note that, since K*(y — u) G £2 (A), the 
set r is necessarily a finite set. We also have, for each minimizer x, 

(x,e) = ^x A e A 
Aer 

^ x\sgn((K*(y-u))\) 
Aer,x A ^o 

= x A sgn(x A ) = ||x||i = R. 

□ 

Remark 5.6 By changing, if necessary, signs of the canonical basis vectors, we can assume, 
without loss of generality, that e A = +1 for all A G T. We shall do so from now on. □ 



5.2 Weak convergence to minimizing accumulation points 

We shall now impose some conditions on the (3^ . We shall see examples in Section [6] 
where these conditions are verified. 

Definition 5.7 We say that the sequence {P^) n€ ^ satisfies Condition (B) with respect 
to the sequence (x( ra )) ngN if there exists uq so that: 

(Bl) /3:=sup{/3 (n) ;nGN} < 00 and inf{ (3 {n) ; n G N } > 1 
(B2) /?( n )||if(2> +1 ) -2>))|| 2 < r||x( n+1 ) -x(")|| 2 Vn > n . 

We shall often abbreviate this by saying that 'the /3^ n ' satisfy Condition (B)'. The constant 
r used in this definition is r := |jisT*ET||£ 2 ^ 2 < 1. (We can always assume, without loss of 
generality, that ||^ 2 _»^ < 1; if necessary, this can be achieved by a suitable rescaling of 
K and y.) 

Note that the choice [3^ = 1 for all n, which corresponds to the projected Landweber 
iteration, automatically satisfies Condition (B); since we shall show below that we obtain 
convergence when the (3^ satisfy Condition (B), this will then establish, as a corollary, 
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convergence of the projected Landweber iteration algorithm ()10p as well. We shall be 
interested in choosing, adaptively, larger values of in particular, we like to choose 

as large as possible. 
Remark 5.8 

• Condition (B) is inspired by the standard length-step in the steepest descent algo- 
rithm for the (unconstrained, unpenalized) functional \\Kx — y\\ 2 . In this case, one 



can speed up the standard Landweber iteration x 



(n+l) 



-(«) 



defining instead x 



(n+l) 



+ K*(y - KxW) by 



;( n ) + aK*(y - KxW), where a is picked so that it gives 



the largest decrease of \\Kx — y\\ in this direction. This gives 



a 



K*(y - Kx^)\\ 2 ] \\\KK*(y - Kx^)\\ 



In this linear case, one easily checks that a also equals 



a 



(27) 



(28) 



in fact, it is this latter expression for a (which inspired the formulation of Condition 
(B)) that is most useful in proving convergence of the steepest descent algorithm. 

• Because the definition of x( n+1 ) involves (3^ n \ the inequality (B2), which uses x^ n+1 ' 
to impose a limitation on [3^ a \ has an "implicit" quality. In practice, it may not 

be straightforward to pick appropriately; one could conceive of trying first a 

Hr( n ) || 2 

"greedy" choice, such as e.g. jj^^yjp > if this value works, it is retained; if it doesn't, 
it can be gradually decreased (by multiplying it with a factor slightly smaller than 1) 
until (B2) is satisfied. (A similar way of testing appropriate step lengths is adopted 
in [32].) 

□ 

In this section we prove that if the sequence (a^ n )) n eN is defined iteratively by (fTTI) . and 
if the used in the iteration satisfy Condition (B) (with respect to the x^), then the 
(weak) limit of any weakly convergent subsequence of (x^) n ^ is necessarily a minimizer 
of V in B R . 

Lemma 5.9 Assume \\K\\i 2 ^f{ < 1 and j3 > 1. For arbitrary fixed x in Br, define the 
functional Fp(-\x) by 



Fp(w;x) := — y\ 



\K(w - x)\\ 2 + -\\w- x\\ 2 . 

(3 



(29) 



Then there is a unique choice for w in Br that minimizes the restriction to Br of Fg(w; x). 
We denote this minimizer by Tr{(3;x); it is given by Tr((3;x) = ¥r(x + (3K*(y — Kx)). 
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Proof: First of all, observe that the functional Fg(-,x) is strictly convex, so that it has a 
unique minimizer on Br; let x be this minimizer. Then for all w G Br and for all t £ [0, 1] 



Fp(x; x) < Fp((l — t)x + tw; x 
2t 

a 



(Kx — y, K(w — x)} — (Kx — Kx, K(w — x)) + — (x — x, w — x) 

P 



t 

+ —\\w - x\\ 2 > 
P 

=> [(3{Kx — y, K(w — x)) + (x — x, w — x)] + -\\w — x\\ 2 > 

=> (x - x + (5K*(Kx - y),w - x) > 
(x + f3K*(y - Kx) - x ,w - x) < 0. 

The latter implication is equivalent to x = Pr(x + 0K*(y — Kx)) by Lemma l4~3l □ 



An immediate consequence is 

Lemma 5.10 If the x^ are defined by (jlip , and the f)^ satisfy Condition (B) with 
respect to the x^ n \ then the sequence ngN is decreasing, and 

lim ||x( n+1 ) -x (n) || =0. (30) 

n^oo 

Proof: Comparing the definition of 

x (n+l) in 

(jlip with the statement of Lemma f5,9l we see 
that x( n+1 ) = T R (/3^;x^), so that x( n+1 ) is the minimizer, for x £ Br, of F^ n ){x;x^). 
Setting 7 = i — 1 > 0, we have 

V{x {n+ ^) < V{x {n+ V) + 1 \\K{x i - n+ V -x^)f 

= \\K X ( n+ V - y\\ 2 + (1 + 7 )||^(x( n+1 ) - X^)f - \\K(x^ n+ ^ - X^)f 

< \\Kx( n+ V - y\\ 2 - \\K{x^ n+ ^ - x^)\\ 2 + -L||:r( n+1 ) - x^\\ 2 

j3( n ) 



We also have 



i^(»)(x (n+1) ;x (n) ) < Fp (n) ( X W;xW) = 2?(iW). 



Fp( n+ i) x( n+1 )) + Fg( n ) (i( n+1 ); I (n) ) 



_ T (")||2 



" — .(••■• ' ir — \\K {x^'~ ' "' — ./■• 



> 



1 ~ r i| x (n+l) _ x («)||2 > 1 ~ r il„fa+l) _^(n)||2 



This implies 



iV 



n=0 



* I~ Z) (^»)( i(fl) ; i(B) ) - F /3(« 



n=0 

N 



n=0 



< 
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Therefore, the series ^2^ =0 \\x^ n+1 ^ —x^\\ 2 converges and linin^oo ||x( n+1 ) — x^\\ = 0. □ 

Because the set {x^;n G N} is bounded in ^i(A) (x n are all in Br), it is bounded 
in £2 (A) as well (since 1 1 ge, 1 1 2 < ||a||i). Because bounded closed sets in £2 (A) are weakly 
compact, the sequence (x^) n ^n must have weak accumulation points. We now have 

Proposition 5.11 (Weak convergence to minimizing accumulation points) Ifx* 
is a weak accumulation point of (x^) n ^ then minimizes T> in Br. 

Proof: Let (x^ nj ^)j^ be a subsequence converging weakly to x&. Then for all a £ ^2 (A) 

(Kx^\a) = (x^,K*a), >{x*,K*a) = (Kx*,a). (31) 



Therefore wlim^oo Kx^ 1 ^ = Kx&. From Lemma 15.101 we have ||x < - n+1 - ) — x^ | 0, so 
that we also have wlim^oo = cc#. By the definition of x^ n+l ^ (x( n+1 ) = Pjj(x™ + 

p( n )K*(y - Kx^))), and by Lemma E2 we have, for all w E Br, 

(iW + 0^K*(y - KxW) - i (n+1) ,w - x( n+1 )) < 0. (32) 

In particular, specializing to our subsequence and taking the limsup, we have 

limsup (arte) - + (i {n ^K*{y - Kx^),w - x^ + ^) < 0. (33) 

Because ||x^ nj ^ — x^ nj ' +1 '|| — > 0, for j — > 00, and w — x^ nj+1 ^ is uniformly bounded, we have 

0, (34) 



lim I (x 

j->oo 



(ry) - x ^ +1 \w -x {n > +1) ) 



so that our inequality reduces to 

limsup P {n ^ (K*(y - Kx^),w - x^ +1 ^) < 0. 



(35) 



3^00 



By adding (3^ n ^{K*{y — Kx^ +1 '),x^ +1 ' — x^ n ^), which also tends to zero as j — > 00, 
we transform this into 



limsup /3 (n ^ (K*(y - Kx {n ^),w - x {n ^} < 0. 

bmce the ft n i' are all in [1,/?], it follows that 

limsup (K*(y - Kx^),w - x {n ^} < 0, 



or 



lim sup 



(K*y,w-x*) - (K*Kx*,w) + \\Kx {n ^\ 



<0, 



where we have used the weak convergence of x^ nj \ This can be rewritten as 



(K*(y - Kx*),w - x#) + limsup [||.Ka;fo>|| a - \\Kx*\^ 



< 0. 



(36) 

(37) 
(38) 

(39) 



Since wlim,gpj Kx^ n ^ = Kx#, we have 



lim sup 



\Kx^f - \\Kx*f 



> 0. 



We conclude thus that 

(K*(y-Kx*),w-x#) < 0, for all w£B R , 
so that x& is a minimizer of T> on Br, by Lemma 15. 11 



(40) 
□ 
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5.3 Strong convergence to minimizing accumulation points 

In this subsection we show how the weak convergence established in the preceding subsec- 
tion can be strengthened into norm convergence, again by a series of lemmas. Since the 
distinction between weak and strong convergence makes sense only when the index set A 
is infinite, we shall implicitly assume this is the case throughout this section. 

Lemma 5.12 For the subsequence (x( n ^)jgn defined in the proof of Proposition \5.11\ 

1xm j - too K(x<™'>) = Kx*. 

Proof: Specializing the inequality ()39[) to w = x*, we obtain 

<0; 

together with ||i^x # || 2 < hminf,-^ \\Kx^ || 2 (a consequence of the weak convergence of 
Kx^ toKx#), this implies lim^oo \\K(x^)\\ 2 = ||iTx # || 2 , and thus lim^oo K( x M) = 
Kx*. □ 



lim sup 



\Kx^\\ 2 - \\Kx*\\< 



Lemma 5.13 Under the same assumptions as in Proposition 15.11] there exists a subse- 
quence [x^'t 1 ) of (x^) nG M such that 
\ Jem 



lim \\x^ - x*\\ = 0, 



£->oo 



(41) 



Proof: Let (a;( nj '))j g N be the subsequence defined in the proof of Proposition 15 . 1 ll Define 
now := x( n ^ — x* and := x^ nj+1 -* — x*. Since, by Lemma 15.101 ||a;( n+1 ) — 



x 



Ml 



, 0, we have ||it^ — v^'\ 



0. On the other hand, 



U U) + x # - w R (u® +x* + (3 (n ^K*(y - K(u® + x*)) 
u ij) + F R (x* + f3^K*(y - Kx*)} 

-F R (x* + P (n ^K*(y - Kx*) + u {j) - f3 {n ^K*Ku^) , 



where we have used Proposition 15.111 (x* is a minimizer) and Lemma 15. II (so that 

x* = F R (x* + (3^K*(y - Kx*))). By Lemma EJ21 lim^c \\Ku^\\ = 0. Since the 

p( n j) ar e uniformly bounded, we have, by formula (|14j) . 



> fl (x* + (3 {n ^K*{y - Kx*) + u ij) - f3 {n ^K*Ku ij) } 
F R (x* + (3 in ^K*{y - Kx*) + 
< 0^\\K*Ku^\\ — 1 



0) 



Combining this with llu^) — v^'W -. ► 0, we obtain 

O I II n ,/Yl 1 



lim 



> R (x* + (3 {n ^K*{y - Kx*) + u {j) 



R (x* + (3^K*(y-Kx*) 



0. 



(42) 
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Since the f3^ n ^ are uniformly bounded, they must have at least one accumulation point. 
Let f3(°°' be such an accumulation point, and choose a subsequence (jV)feN such that 
lim^oo f}( n u) = 0(°°\ To simplify notation, we write n' e := n je , u'^ := u^ l \ v'^ := v^ l \ 
We have thus 

lim^oo /?K) = , and 

lim^oo F R (x# + 0( n dK*(y- Kx*) + u'W\ (43) 
- F R [x* + P^K* (y - Kx#)^j - u'W = 0. 

Denote h* := x* + P^K*{y - Kx#) and h'W := x# + pW)K*(y - Kx*). We have now 
\\F R (h# + u'W) -F R {h#) - u'M\\ 

< \\F R (ti®+uM)-F R (h'®)-uM\\ 

+ \\F R (tiW+u'W)-F R (h# + u'®)\\ + \\F R (ti^)-F R (h#)\\ 

< \\F R (h'W + u'W) - F R (h'W) - u'W|| + 2||fc'W - /i # ||. 

Since both terms on the right hand side converge to zero for £ — > 00 (see (I43p ). we have 

lim ||Pr(/i # + - P fl (/i # ) - || = 0. (44) 

£—►00 

Without loss of generality we can assume > R. By Lemma 14.21 there exists // > 

such that F R (h#) = § M (/i#). Because \tif\ — > as |A| — > 00, this implies that, for some 
finite K\ > 0, (F R (h#)) x = for |A| > K\. Pick now any e > that satisfies e < /x/5. 

There exists a finite > so that J2\\\>k 2 I^a I 2 < g2 - Set ^0 := max (-Kd) -^2); and 
define the vector h* by h* = h* if |A| < K , h* = if |A| > iT - 

By the weak convergence of the u'^\ we can, for this same Kq, determine L\ > such 
that, for all I > Lj, ^|A]<Ko l u > , P - g2 - Define new vectors u'^ by = if [A| < Kq, 
u'W= U fii\\\>K . " 

Because of (JSJ, there exists L 2 > such that ||Pj?(/i # + u'W) - Pj?(/i # ) - u'M || < e 
for I > L<i- Consider now £ > L : = max(Li, L2). We have 

\\ ¥R (h# + u>W)-F R (h#)-u'W\\ 
< \\F R (h# + u'W)-F R (h* + u'W)\\ + \\F R {h* + u'^) -F R (h* + u'^)\\ 



+ \\F R (h# + u'^) - F R (h#) - u'®\\ + \\F R (h#) - F R (h#)\\ + 



5e 



On the other hand, Lemma 14.21 tells us that there exists og > such that F R (h& + 
u'W) = E> ae (h# + u'^>) = § ai (h#) + S ai (u'^), where we used in the last equality that 
hf = for |A| > K and uf ] = for |A| < K . From p^h*)^ = R = p^h*)^ + 
1 1 So* )||l we conclude that ag > /x for all £ > L. We then deduce 

(5e) 2 > ||p fl (fc# + fi'W)-P Jl (^#)-fi'Wf 



= Yl \s„M)-sM)\ 2 + E 

|A|<iCo |A|>if 

> [max(|nf|-^0)-|nf |] 2 

\\\>K 



^ min ( \,(Te) > ^ min (|«a^ I ' ^ 

|A]>K- |A|>i<:o 
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Because we picked e < fj,/5, this is possible only if \uf®\ 
if, in addition, 



< jJL for all |A| > Kq, t > L, and 



E 

|A|>AT 



1/2 



< 5e , 



i.e. 



< 5e 



(45) 



It then follows that llu'^l 



< 



+ 



E 



|A|<X l U A 



'Ml 2 



1/2 



< 6e. 



We have thus obtained what we set out to prove: the subsequence (x n ^)^ gN of (a^ n ') n eN 
satisfies that, given arbitrary e > 0, there exists L so that, for i > L, \\x nj t — x&\\ < 6e. □ 
Remark 5.14 In this proof we have implicitly assumed that + u^\\i > R. Given 
that || bft ||i > R, this assumption can be made without loss of generality, because it is 



not possible to have ||/t*||i > R and ||/t* + 



,.0) I 



< R infinitely often, as the following 



argument shows. Find Kq, Lq such that E|AI<-Kn I^A I — (ll^lli +-^)/^ anc ^' ^ — -^0 an d 
V|A| < 2T : \u 



'Mi 



< (^(ll^lll " «)/4- Then £ |A|<i , o I^A + u \ I > £ 



/|A|<Jf 1 A I 



J/i # ||i + i?)/2 - (||/i # ||i - fl)/4 = + (||/i # ||i - fl)/4 > fl. Hence, W > L , 
||/i # + > i?. □ 

Remark 5.15 At the cost of more technicalities it is possible to show that the whole 
subsequence (x( nj '))j g N defined in the proof of Proposition 15.111 converges in norm to x#, 
i.e., that lim^oo ||x^ n ^ — x#\\ = 0, without going to a subsequence (x nj () i&N . □ 
The following proposition summarizes in one statement all the findings of the last two 
subsections. 



Proposition 5.16 (Norm convergence to minimizing accumulation points) Every 
weak accumulation point x# of the sequence (x( n )) ng N defined by (jllh is a minimizer of 
T> in Br. Moreover, there exists a subsequence (x^^gn of (x( n )) ng pj that converges to 
x^ in norm. 



5.4 Uniqueness of the accumulation point 

In this subsection we prove that the accumulation point x# of (x( n )) ng N is unique, so 
that the entire sequence (x( ra )) ng N converges to x* in norm. (Note that two sequences 
(x( ra )) ng pj and (x'^) ng N> both defined by the same recursion, but starting from different 
initial points x^ ^ ^ x'^°\ can still converge to different limits x* and x'*.) 
We start again from the inequality 

(x (n) + 0^K*(y - Kx in) ) - x in+1 \w - x (n+1) ) < 0, (46) 

for all w € Br and for all n £ N, and its many consequences. Define Mr to be the set of 
minimizers of T> on Br. By Lemma |5.2[ Mr = Br n (x + keriT), where x is an arbitrary 
minimizer of T> in Br. By the convention adopted in Remark 15.11 

Mr C B+ := I x e ti(A);x\ > for all A <E A, and ^ x A < -R > . (47) 
I AeA J 

Moreover, for each element z € Mr, z\ = if A ^ T (see Lemma [53]). The set Mr is both 
closed and convex. We define the corresponding (nonlinear) projection operator ^m r as 
usual, 

F Mr (v) := arg min{||w - z|| 2 ; z G Mr } . (48) 
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Because Mr is convex, this projection operator has the following property: 

Vx £ Mr : (z - F Mr (z),x - P Mr (z)) < 0. (49) 

(The proof is standard, and is essentially given in the proof Lemma [4. 3 ( where in fact only 
the convexity of Br was used.) For each re £ N, we introduce now and defined by 



o(") := F Mr {x^), feW = - a {n \ 
Specializing equation (|49p to x^ n \ we obtain, for all x £ Mr and for all re £ N: 

( x (») - W,£- W) < 0. 



or 



(&W,x-a< n >) < 0. 

Because a^ n ' is a minimizer, we can also apply Lemma 15. II to and conclude 
(K*(y - Ka {n) ), w - a (n) ) < 0, for all w £ Br. 
With these inequalities, we can prove the following crucial result. 
Lemma 5.17 For any x £ Mr, and for any n £ N, 

|| x («+i) - x\\ < \\xW -x\\. 
Proof: We set w = x in (06|), leading to 

(x (n) - x (n+1 \ x - x( n+ V) + p {n) (K*(y - Kx^), -b {n+l) ) < 0, 



(50) 

(51) 
(52) 

(53) 



(54) 



(55) 



where we have used that Kx = Ka^ n+1 \ We also have, setting w = x( n+1 ) in the (re + In- 
version of ([53]) . 

(K*(y - Ka^),x {n+l) - a {n+l) ) < 0, 



or 



(K*{y-Ka^)M n+1) ) < 0, 
where we have used Ka^ = Ka^ n+1 \ It follows that 

(xW -x (n+1) ,x-x (n+1) ) + (3 {n) {-K*Kb {n \-b^ n+ ^) < 0, 



or 



(x (n) -x (n+1 \x-x {n+1) ) + (3^ {Kb( n \ Kb {n+1) ) < 0, 
which is also equivalent to 

(iW - x, x - x( n+1 )) + ||x - x( n+1 ) || 2 + [||itt/ n ) || 2 + ||K6( n+1 ) |p 

- i/3^||K6( n ) -K6( n+1 )|| 2 < 0. 
Adding ±/3( n ) |[if(6( n ) -&( n+1 ))|| 2 < §||x( n ) -x( n+1 )|| 2 to ([60]), we have 



(56) 
(57) 

(58) 
(59) 



(60) 



(X (n) -X,X-2> 



+ ||£ _ ||2 + ^(n) J|| K6 (n) ||2 + ^(n+l) ||2 



~{n) _ An 



— I|X V "' — X 

..(«) M|2 



2 
r 

2 L 



4-D|p 



,r| - -|- ||x( n+1 ) - X|| 2 - 2{lW - X,x( n+1 ) - X) 
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It follows that 



(l - ||x( n+1 ) - x|| 2 + (1 - r)(x - X ( n \x( n+ V -x)-L\\x- x^f 



<-V n) !l^6 (n) || 2 + ||^6 (n+1) || 2 
which, in turn, implies that 

(l-~) j|x (n+1) - x\\ 2 - (l - r)\\i - x in) \\\\x( n+ V -x\ 

This can be rewritten as 



< 0, 



r 



-||x-x (n) || 2 < 0. 



\x — X 



(n+l)\ 



\x — X 



Ml 



x H — x — x 



Ml 



<0, 



which implies \\x 



(n+l) 



a; I. 



(61) 
(62) 

(63) 

□ 



We are now ready to state the main result of our work. 



Theorem 5.18 The sequence { x ) ne ^ as defined in (jlip . where the step-length sequence 

(^ n ^)neN sa ^ s fi es Condition (B) with respect to the x^ n \ converges in norm to a minimizer 
ofV on B R . 

Proof: The sequence (x^-^neN has a least one accumulation point x#. By Proposition 
15.111 x^ minimizes T> in Br. By Proposition 15.161 (x^ n ') ne ^ has a subsequence )^ g ^ 
that converges to x*. By Lemma l5.17l \\x^ — x*|j decreases monotonically, hence it has 
a limit for n — > oo, and 



lim ||x^ — x # || = lim ||x^ n ^ — x*| 
n— >oo >oo 



0. 



(64) 
□ 



6 Numerical Experiments and Additional Algorithms 

6.1 Numerical examples 

We conduct a number of numerical experiments to gauge the effectiveness of the different 
algorithms we discussed. All computations were done in Mathematica 5.2 [36] on a 2Ghz 
workstation with 2Gb memory. 

We are primarily interested in the behavior, as a function of time (not number of 
iterations), of the relative error ||x( n ) — x||/||x||. To this end, and for a given operator K 
and data y, we need to know in advance the actual minimizer x(r) of the functional ©. 

One can calculate the minimizer exactly (in practice up to computer round-off) with 
a finite number of steps using the LARS algorithm described in [29] (the variant called 
'Lasso', implemented independently by us). This algorithm scales badly, and is useful in 
practice only when the number of non-zero entries in the minimizer x(r) is sufficiently 
small. We made our own implementation of this algorithm to make it more directly appli- 
cable to our problem (i.e., we do not renormalize the columns of the matrix to have zero 
mean and unit variance, as it is done in the statistics context |29j). We also double-check 
the minimizer obtained in this manner by verifying that it is indeed a fixed point of the 
iterative thresholding algorithm (|7|) (up to machine epsilon). We then have an 'exact' 
minimizer x together with its radius it! = ||x||i (used in the projected algorithms) and, 
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according to Lemma 15.31 the corresponding threshold r = maxj |rj| with r = K*(y — Kx) 
(used in the iterative thresholding algorithm). 

The numerical examples below are listed in order of increasing complexity; they il- 
lustrate that the algorithms can behave differently for different examples. In these ex- 
periments we choose (3^ = (5^ '■= \\A n ^ \\ 2 /\\KA n ^ || 2 , (where, as before, = K*(y — 
Kx^)); (3^ is the standard descent parameter from the classical linear steepest descent 
algorithm. 

1. When K is a partial Fourier matrix (i.e., a Fourier matrix with a prescribed num- 
ber of deleted rows), there is no advantage in using a dynamical step size = 
||r( n ) |[ 2 /||ETr^ n ) || 2 as this ratio is always equal to 1. This trivially fulfills Condition 
(B) in Section 15.11 The performance of the projected steepest descent iteration 
simply equals that of the projected Landweber iterations. 

2. By combining a scaled partial Fourier transform with a rank 1 projection operator, 
we constructed our second example, in which K is a 1536 x 2049 matrix, of rank 
1536, with largest singular value equal to 0.99 and all the other singular values be- 
tween 0.01 and 0.11. Because of the construction of the matrix, the FFT algorithm 
provides a fast way of computing the action of this matrix on a vector. For the 
y and r that were chosen, the limit vector x T has 429 nonzero entries. For this 
example, the LARS procedure is slower than thresholded Landweber, which in turn 
is significantly slower than projected steepest descent. To get within a distance of 
the true minimizer corresponding to a 5% relative error, the projected steepest de- 
scent algorithm takes 2 sec, the thresholded Landweber algorithm 39 sec, and LARS 
151 sec. (The relatively poor performance of LARS in this case is due to the large 
number of nonzero entries in the limit vector x T ; the complexity of LARS is cubic in 
this number of nonzero entries.) In this case, the & { t n) = ||r( n )|| 2 /||i^r( n )|| 2 are much 
larger than 1; moreover, they satisfy Condition (B) of Section f5.il at every step. We 
illustrate the results in Figure 
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Figure 3: The different convergence rates of the thresholded Landweber algorithm (dotted 
line), the projected steepest descent algorithm (solid line, near vertical axis) and the LARS 
algorithm (dashed line), for the second example. The projected steepest descent algorithm 
converges much faster than the thresholded Landweber iteration. They both do better 
than the LARS method. 

3. The last example is inspired by a real-life application in geoscience [38J, in particular 
an application in seismic tomography based on earthquake data. The object space 
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Figure 4: The different convergence rates of the thresholded Landweber algorithm (solid 
line), the projected Landweber algorithm (dashed line) and the projected steepest descent 
algorithm (dotted line), for the third example. The projected steepest descent algorithm 
converges about four times faster than the thresholded Landweber iteration. The projected 
Landweber iteration does better at first (not visible in this plot), but looses with respect 
to iterative thresholding afterwards. The horizontal axis has time (in hours), the vertical 
axis displays the relative error. 

consists of the wavelet coefficients of a 2D seismic velocity perturbation. There 
are 8192 degrees of freedom. In this particular case the number of data is 1848. 
Hence the matrix K has 1848 rows and 8192 columns. We apply the different 
methods to the same noisy data that are used in [38] and measure the time to 
convergence up to a specified relative error (see Table [Q and Figured!). This example 
illustrates the slow convergence of the thresholded Landweber algorithm (|T|), and the 
improvements made by a projected steepest descent iteration (fTTj) with the special 
choice /?( n ) = [3^ above. In this case, this choice turns out not to satisfy Condition 
(B) in general. One could conceivably use successive corrections, e.g. by a line- 
search, to determine, starting from f3^\ values of that would satisfy condition 

(B), and thus guarantee convergence as established by Theorem 5.18. This would 

(n) 

slow down the method considerably. The /3 st . seem to be in the right ballpark, 
and provide us with a numerically converging sequence. We also implemented the 
projected Landweber algorithm (fl~0|) ; it is listed in Tableland illustrated in Figure 

m ' 

The matrix K in this example is extremely ill-conditioned: its largest singular value 
was normalized to 1, but the remaining singular values quickly tend to zero. The 
threshold was chosen, according to the (known or estimated) noise level in the data, 
so that V(x)/a 2 = 1848 ( = the number of data points), where a is the data noise 
level; this is a standard choice that avoids overfitting. 

In Figure [H we see that the thresholded Landweber algorithm takes more than 21 
hours (corresponding to 200, 000 iterations) to converge to the true minimizer within 
a 3% relative error, as measured by the usual l<i distance. The projected steepest 
descent algorithm is about four times faster and reaches the same reconstruction 
error in about 5.5 hours (25,000 iterations). Due to one additional matrix- vector 
multiplication and, to a minor extent, the computation of the projection onto an 
£i-ball, one step in the projected steepest descent algorithm takes approximately 
twice as long as one step in the thresholded Landweber algorithm. For the projected 
Landweber algorithm there is an advantage in the first few iterations, but after 
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Table 1: Table illustrating the relative performance of three algorithms: thresholded 
Landweber, projected Landweber and projected steepest descent, for the third example. 



a short while, the additional time needed to compute the projection P/j (i.e., to 
compute the corresponding variable thresholds) makes this algorithm slower than 
the iterative soft-thresholding. We illustrate the corresponding CPU time in Table 

m 

It is worthwhile noticing that for the three algorithms the value of the functional 
([5]) converges much faster to its limit value than the minimizer itself: When the 
reconstruction error is 10%, the corresponding value of the functional is already 
accurate up to three digits with respect to the value of the functional at x. We can 
imagine that in this case the functional has a long narrow "valley" with a very gentle 
slope in the direction of the eigenvectors with small (or zero) singular values. The 
path in the vs. \\Kx — y\\ 2 plane followed by the iterates is shown in Figure 
[U The projected steepest descent algorithm, by construction, stays within a fixed 
£i-ball, and, as already mentioned, converges faster than the thresholded Landweber 
algorithm. The path followed by the LARS algorithm is also pictured. It corresponds 
with the so-called trade-off curve which can be interpreted as the border of the area 
that is reachable by any element of the model space, i.e., it is generated by x(r) for 
decreasing values of r > 0. 

In this particular example, the number of nonzero components of x equals 128. The 
LARS (exact) algorithm only takes 55 seconds, which is much faster than any of the 
iterative methods demonstrated here. However, as illustrated above, by the second 
example, LARS looses its advantage when dealing with larger problems where the 
minimizer is not sparse in absolute number of entries, as is the case in, e.g., realistic 
problems of global seismic tomography. Indeed, the example presented here is a "toy 
model" for proof-of-concept for geoscience applications. The 3D model will involve 
millions of unknowns and solutions that may be sparse compared with the total 
number of unknowns, but not sparse in absolute numbers. Because the complexity 
of LARS is cubic in the number of nonzero components of the solution, such 3D 
problems are expected to lie beyond its useful range. 

6.2 Relationship to other methods 

The projected iterations (fl~6|) and (fT7]) are related to the POCS (Projection on Convex 
Sets) technique [3]. The projection of a vector a on the solution space {x : Kx = y} (a 
convex set, assumed here to be non-empty; no such assumption was made before because 
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the functional © always has a minimum) is given by: 



x = a- K*(KK*)-\y - Ka) (65) 

Hence, alternating projections on the convex sets {x : Kx = y} and Br give rise to the 
algorithm [5j: : Pick an arbitrary x^ £ £2 (A), for example x^ = 0, and iterate 

x (n+i) = p fl ( x (n) _ K*(KK*)~ 1 (y - KxW)) (66) 

This may be practical in case of a small number of data or when there is structure in K, 
i.e., when KK* is efficiently inverted. Approximating KK* by the unit matrix, yields 
the projected Landweber algorithm ()16|) : approximating (KK*)" 1 by a constant multiple 
of the unit matrix yields the projected gradient iteration (|17j) if one chooses the constant 
equal to . The projected methods discussed in this paper produce iterates that (except 




Figure 5: Trade-off curve (solid line) and its approximation with algorithm (|67p in 200 
steps (dashed line). For comparison, the iterates of projected steepest descent are also 
indicated (triangles). 

for the first few) live on the 'skin' of the £i-ball of radius R, as shown in Fig. [TJ We have 
found even more promising results for an 'interior' algorithm in which we still project on 
■^i-balls, but now with a slowly increasing radius, i.e., 

x (n+i) = ^(n) + p(n) r Wi\ j R (n) = ( n + l)R/N, and n = 0, . . . , N, (67) 

where N is the prescribed maximum number of iterations (the origin is chosen as the 
starting point of this iteration). We do not have a proof of convergence of this 'interior 
point type' algorithm. We observed (also without proof) that the path traced by the 
iterates x^ (in the ||ac||i vs. \\Kx — y\\ 2 plane) is very close to the trade-off curve (see Fig. 
5); this is a useful property in practice since at least part of the trade-off curve should be 
constructed anyway. 

Note that the strategy followed by these algorithms is similar to that of LARS |29j . in 
that they both start with x^ = and slowly increase the i\ norm of the successive ap- 
proximations. It is also related to |36j . 

While we were finishing this paper, Michael Friedlander informed us of their numerical 
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results in [15] which are closely related to our approach, although their analysis is limited 
to finite dimensions. 

Different, but closely related is also the recent approach by Figueiredo, Nowak, and 
Wright [32] . The authors first reformulate the minimization of ([5]) as a bound-constrained 
quadratic program in standard form, and then they apply iterative projected gradient iter- 
ations, where the projection act componentwise by clipping to zero negative components. 

7 Conclusions 

We have presented convergence results for accelerated projected gradient methods to find 
a minimizer of an i\ penalized functional. The innovation due to the introduction of 'Con- 
dition (B)' is to guarantee strong convergence for the full sequence. Numerical examples 
confirm that this algorithm can outperform (in terms of CPU time) existing methods such 
as the thresholded Landweber iteration or even LARS. 

It is important to remark that the speed of convergence may depend strongly on how 
the operator is available. Because most of the time in the iterations is consumed by 
matrix- vector multiplications (as is often the case for iterative algorithms), it makes a 
big difference whether K is given by a full matrix or a sparse matrix (perhaps sparse in 
the sense that its action on a vector can be computed via a fast algorithm, such as the 
FFT or a wavelet transform). The applicability of the projected algorithms hinges on 
the observation that the £2 projection on an i\ ball can be computed with a O(mlogm)- 
algorithm, where m is the dimension of the underlying space. 

There is no universal method that performs best for any choice of the operator, data, 
and penalization parameter. As a general rule of thumb we expect that, among the 
algorithms discussed in this paper for which we have convergence proofs, 

• the thresholded Landweber algorithm (|7|) works best for an operator K close to the 
identity (independently of the sparsity of the limit), 

• the projected steepest descent algorithm (jlip works best for an operator with a 
relatively nice spectrum, i.e., with not too many zeroes (also independently of the 
sparsity of the minimizer), and 

• the exact (LARS) method works best when the minimizer is sparse in absolute terms. 

Obviously, the three cases overlap partially, and they do not cover the whole range of 
possible operators and data. In future work we intend to investigate algorithms that 
would further improve the performance for the case of a large ill-conditioned matrix and 
a minimizer that is relatively sparse with respect to the dimension of the underlying 
space. We intend, in particular, to focus on proving convergence and other mathematical 
properties of ([67|) . 
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